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Abstract 

We present an algorithm which attains O(VT) internal (and thus external) regret for finite games 
with partial monitoring under the local observability condition. Recently, this condition has been shown 
by Bartok, Pal, and Szepesvari ,4 to imply the 0(\/T) rate for partial monitoring games against an i.i.d. 
opponent, and the authors conjectured that the same holds for non-stochastic adversaries. Our result 
is in the affirmative, and it completes the characterization of possible rates for finite partial-monitoring 
games, an open question stated by Cesa-Bianchi, Lugosi, and Stoltz |Bj. Our regret guarantees also hold 
for the more general model of partial monitoring with random signals. 



1 Introduction 

Imagine playing a repeated zero-sum game against an opponent (column player) where the loss is denned by 
a given matrix L £ R NxM . Unlike the classical full-information scenario, however, we (the row player) do not 
observe the moves of the opponent and instead receive some signal given by the known matrix H G 5] 
defined over some alphabet S. Specifically, for the choices i and j of the row and column players, the row 
player observes the signal Hij. Neither the move of the opponent nor the incurred loss Ljj is observed by 
the row player. In this paper, we are concerned with rates for external and internal regret achievable in this 
scenario. 

The question of characterizing such rates in terms of the matrices L and H has been raised by Cesa- 
Bianchi, Lugosi, and Stoltz [BJ. Under a linear dependence between the matrices L and H, the authors 
proved 0(T 2 / 3 ) rates for external regret, yet noted that there exist games with the &(Vt) behavior (e.g. 
the so-called bandit feedback games where L = H). Similar distinction in available rates also appears to hold 
for internal regret: an 0(T 2 / 3 ) upper bound was shown in [BJ, while the rate of 0(\/T) is achievable for 
bandit feedback by the result of Blum and Mansour [5] . 

Recently, Bartok, Pal, and Szepesvari in OH] made key insights into the problem of partial monitoring. In 
particular, [4 characterized the rates for external regret against an i.i.d. (stochastic) opponent. The authors 
showed that rates can only be one of 6(l),e(v / T),e(T 2 / 3 ) ande(T), and that a so-called local observability 
condition plays a key role in determining this growth behavior. In the non-stochastic (adversarial) case, 
however, no general characterization is available to date, with the notable exception of games with two 
adversarial actions [3j. As suggested by [I], to provide a complete characterization for external regret 
against non-stochastic opponents, it would be enough to show an upper bound of O(VT) under the local 
observability condition. The characterization would follow because |4] proves a f2(T 2 / 3 ) lower bound when 
local observability does not hold (yet the game is not hopeless with fi(T) regret) and the upper bound of 
(9(T 2 / 3 ) is achieved by the algorithm of Piccolboni and Schindelhauer [IU] through the analysis of [BJ. 

This paper presents an algorithm, Neighborhood Watch, with an upper bound of O(VT) for both internal 
and external regret against a non-stochastic opponent under the local observability condition. Together with 
the results mentioned above, this completes the characterization for both internal and external regret. It is 
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remarkable that the condition of local observability that characterizes games against a stochastic environment 
also characterizes games against non-stochastic opponents. 

We now summarize our approach. First, we define a notion of local internal regret which postulates that 
the player does not benefit by switching any of its actions to a neighboring action. The neighbor relation is 
defined by the neighborhood graph of best responses to mixed strategies of the opponent. Second, we show 
that small local internal regret implies small (global) internal regret. We then present an algorithm which 
randomly chooses a neighborhood and then chooses an action in the neighborhood. A key property satisfied 
by the two-level procedure is a certain flow condition. Under this condition, external regret of sub-algorithms 
on local neighborhoods can be turned into a statement about local internal regret (and, hence, global internal 
regret). External regret of the sub-algorithms, in turn, can be upper bounded because local observability 
condition allows us to estimate relative losses of neighboring actions. 

2 Notation and definitions 

We follow the notation of jlj. Let £i denote the ith row of L. Without loss of generality, assume that each 
row of H contains unique sets of symbols. Let o~±, . . . , o~ Si be the list of symbols in the ith row of H. The 
signal matrix Si £ {0, l} s i xM is defined by Si(k,j) = I{Hi,j = ak\ where I{} is the indicator function. For 
a pair z, k of actions define £ {o, l}( Si+Sfc ) xM by stacking Si on top of Sk- Note that, upon playing 

action z, the signal Hi j arising from the unobserved action j is equivalent to the feedback Siej. 

Let C — {Ci, . . . , Cjy} be a partition of the simplex Am according to the best response (action) of the 
player to the mixed strategy of the adversary: 



We assume that no action is completely dominated by others; that is, each Ci is non-empty. Further, for 
simplicity we assume that C is indeed a partition and there are no degeneracies (we can modify the argument 
by defining neighborhood action sets as in [4]). Neighboring actions are naturally defined as those that share 
a boundary in the partition. Let Q be the graph obtained by connecting the neighboring cells of the partition 
C. The vertex set of Q is precisely the set {1, . . . , N} of player's actions. For each action i, let the set of 
its neighbors Ni be called the neighbor set. By convention, any vertex is its own neighbor: i £ Ni. We will 
often use the terms action and vertex interchangeably thanks to the one-to-one correspondence. 

Definition 2.1 (Bartok, Pal, Szepesvari U). The game is called locally observable if £i — £j £ Im SLf\ f or 
all neighboring actions i,j. 

Under the local observability condition, for each pair of local actions i, j there exists a vector Vu j\ such 
that £j — £i — ST j\V(i,j)- Since L and H are known, we can compute vectors and use them to construct 
unbiased estimates of true loss differences. 

Notation Let [N] denote the set {1, ... ,N}. For a subset S C [N] we use l s £ {0, 1}^ to denote the 
vector with ones on the coordinates in S and zeros outside. A vector a £ M. N indexed by j is sometimes 
denoted by [oij]je[JV]- The scalar product between two vectors a and b will be variously written as a T b or 
a ■ b. Standard basis vectors are denoted by {e^}. 

3 Internal Regret in the Neighborhood 

Let 4> : {1, . . . , N} t- > {1 , . . . , N} be a departure function [6], and let it and jt denote the moves at time t 
of the player and the opponent, respectively. At the end of the game, regret with respect to (j) is calculated 
as the difference of the incurred cumulative cost and the cost that would have been incurred had we played 
action <fi(it) instead of it, for all t. Let $ be a set of departure functions. (f>-regret is defined as 



Ci = {q £ Am '■ i is best response for q}. 
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where the cost function considered in this paper is simply c(i,j) — e]Lej. If $ = {<f>k '■ k G [N]} consists of 
constant mappings tf>k(i) — k, the regret is called external. For (global) internal regret, the set $ consists of 
all departure functions <pi^j such that = j and 4>i^j(h) = h for h =/= i. 

Definition 3.1. A departure function 4n->.j is called local departure function if j is a neighbor of i in the 
neighborhood graph Q . Regret defined with respect to the set of all local departure functions is called local 
internal regret. 

Under the local observability condition, we can estimate the differences in performance between the 
action and its neighbors in a way similar to non-stochastic bandit methods. We can, therefore, ensure that 
any time we chose an action, its loss was not much more than that of any of its neighbors. That is, local 
observability condition leads to an algorithm with no external regret and, under the flow condition detailed 
later, no local internal regret. A key observation is that no local internal regret implies no global internal 
regret. Intuitively, this stems from the fact that the second-best-response action must be a neighbor of the 
best-response action. Hence, ensuring small internal regret against the neighbors is enough to guarantee 
small internal regret. 




Figure 1: Illustration of the argument in Lemma 3.1 A second-best action must either be a neighbor, or it 
must be dominated everywhere by other actions. 



Lemma 3.1. Local internal regret is equal to internal regret. 

Proof. It is enough to show that, for any distribution q G Am, any best response i\ and any second-best 
response ii are neighbors in the graph Q. By the way of contradiction, we assume that actions i\ and i^ 
are not neighbors (that is, and Ci 2 do not share a face). We will then arrive at the conclusion that i^ 
must be dominated by other actions, which is a contradiction because of our assumption that no action is 
completely dominated (that is minorized) by others. 

Let g(s) = min ig rjv] eJLs be the minimum loss against the mixed strategy s. Since g is a minimum of ./V 
linear functions {/fc(s) = (s T k L) ■a}^L 1 , it is concave and piece- wise linear. The linear parts of g correspond to 
the elements of the partition C. By our assumption, fi t (q) < fi 2 (q) and there is no hyperplane fi 3 achieving 
at q a value in the interval fi 2 (<?)). Let 



S = {(s,t)e 



pM+l 



t = fh(s) = fi 2 (s) for some s G A M }, 



the intersection of two hyperplanes over the simplex. Note that projection of S onto the simplex would 
be precisely the boundary separating and Ci 2 if these were the only two actions. This set cannot be 
empty, for otherwise action 12 is dominated by i\. Now, pick any p £ Am such that /^(p) = fi 2 {p), an d 
let a = (p, fi t (p)) (see Figure [3]). We will now work with the one-dimensional problem along the line in the 
simplex defined by (q,p). The fact that i\ and i^ are not neighbors along the direction (q,p) means that 
there is another action 13 such that fi 3 (p) < f^ip) — fi 2 ip)- Since fi 3 (q) > fi 2 {q) > there must be a 

point b = (r, fi 3 {r)) — (r, fi 2 {r)) of intersection of fi 3 and fi 2 for some r G [q,p\- It is easy to see that ii is 
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Figure 2: To each vertex i in the graph Q we associate an algorithm A4. The algorithm plays an action 
from the distribution q\ over its neighborhood set Ni and receives partial information about relative loss 
between the node i and its neighbor. The other piece of the partial information comes from the times when 
a neighboring algorithm Aj is run and the action i is picked. 



Algorithm 1 Neighborhood Watch Algorithm 

l: For all i = {1, . . . , N}, initialize algorithm Ai with q\ = x \ = 1^/1-^1 
2: for t=l,. . . , T do 

3: Let Q* = [<?*,..., q^], where q\ is furnished by Ai 
4: Find p l satisfying p* = Q t p t 
5: Draw kt from p l 

6: Play It drawn from q\ and obtain signal Si t ej t 
7: Run local algorithm Ak t with the received signal 
8: For any i ^ k t , q\ +1 <- q\ 
9: end for 



completely minorized along the direction (q,p): on one side of r it is dominated by i±, while on the other — 
by h- 

The argument above works for any direction from q towards the boundary between and Ci 2 if i\ and 
12 were the only actions. Hence, i 2 is globally dominated by other actions, a contradiction. □ 

4 Method 

The method is a two-level procedure motivated by Foster and Vohra [7J and Blum and Mansour [5]. The 
intuition stems from the following observation. Suppose for each vertex i we have a distribution € A^r 
supported on the neighbor set A^. Let p € An be defined by p = Qp where Q is the matrix [qi, . . . ,<?at]. 
Then there are two equivalent ways of sampling an action from p. First way is to directly sample the vertex 
according to p. Second is to sample a vertex i according to p and then choose a vertex j within the neighbor 
set Ni according to qi. Because of the stationarity (or flow) condition p = Qp, the two ways are equivalent. 
This idea of finding a fixed point is implicit in [7] , and Blum and Mansour [S] show how stationarity can be 
used to convert external regret guarantees into an internal regret statement. We show here that, in fact, this 
conversion can be done "locally" and only with "comparison" information between neighboring actions. 

Our procedure is as follows. We run N different algorithms Ai, . . . ,A/v, each corresponding to a vertex 
and its neighbor set. Within this neighbor set we obtain small regret because we can construct estimates of 
loss differences among the actions, thanks to the local observability condition. Each algorithm Ai produces 
a distribution q\ 6 Ajv at round t, reflecting the relative performance of the vertex i and its neighbors. 
Since Ai is only concerned with its local neighborhood, we require that q\ has support on Ni and is zero 
everywhere else. The meta algorithm Neighborhood Watch combines the distributions Q* = [q\, . . . ,q* N ] and 
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Algorithm 2 Local Algorithm A% 
1: If t = 1, initialize s = 1 

2: For r £ {T~i(s — !) + !,..., Tj(s)} (i.e. for all r since the last time A% was run) construct 



for all j £ Ni 
3: Define for all j £ Ni. 



and let 



ir t 



l{I r = i}Si 

I{k r = i}I{I r =j}S j / q r(j) 



Ms) 

r=T«(s-l)+l 



-3t 



ie[iV] 



4: Pass the cost ff to a full-information online convex optimization algorithm over the simplex (e.g. Ex- 
ponential Weights Algorithm) and receive the next distribution x s+l supported on iV, 
5: Define 

(?l t + 1 ^(l- 7 )x s+1 + ( 7 /|^|)liv s 



6: Increase the count s <— s + 1 



computes p* as a fixed point 

p* = QV ■ (i) 

How do we choose our actions? At each round, we draw k t ~ pt and then I t ~ according to our 
two-level scheme. The action 7 t is the action we play in the partial monitoring game against the adversary. 
Let the action played by the adversary at time t be denoted by jt- Then the feedback we obtain is Si t ej t . 
This information is passed to Ak t which updates the distributions q^. . In Section 
done. 

4.1 Main Result 

The main result of the paper is the following internal regret guarantee. 
Theorem 4.1. Local internal regret of Algorithm^ is bounded as 



4.2 



we detail how this is 



supE j^(e /t - e 0(/t) ) T Le Jt j < ANvy/6(log N)T 



where v — max(jj) ||U(jj)||oo an d supremum is taken over all local departure functions. 



The next Corollary is immediate given Lemma 3.1 



Corollary 4.1. Internal regret of Algorithm^ is also bounded as in Theorem 4-1 



We remark that high probability bounds can also be obtained in a rather straightforward manner, using, 
for instance, the approach of [1]. Another extension, the case of random signals, is discussed in Section [5] 
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4.2 Estimating loss differences 

The random variable kt drawn from p at time t determines which algorithm is active on the given round. 
Let 



Ti(s) = min{i : s = I {kt = i}} 



denote the (random) time when the algorithm Ai is invoked for the s-th time. By convention, t,(0) = 0. 
Further, dehne 

7r,(t) = min{i' > t : k t ' = i} 

to denote the next time the algorithm is run on or after time t. When invoked for the s-th time, the algorithm 
Ai constructs estimates 



ir At 



I {I r = i} Si 
I{k r =i}I{I r =j}S j /qT{j) 



Vre { n {s-i) + i,...,Ti{s)}, VjeN, 



for all the rounds after it has been run the last time, until (and including) the current time r = Ti(s). We 
can assume = for any j ^ JVj. The estimates 6*- ^ can be constructed by the algorithm because Si r ej r 
is precisely the feedback given to the algorithm. 

Let Tt be the c-algebra generated by the random variables {fci, I\, . . . , kt, It}- For any t, the (conditional) 
expectation, 



E 



k=l 



Si 




PiQi(j) 







— Pi V i,jS(i,j) e jt 
= Pi ~ f-i ) e jt 

= Pi(ej - e t ) T Le jt 



(2) 



where in the second equality we used the fact that Ylk=i Pklki^) ~ Pi ^ stationarity ([T]). Thus each 
algorithm Ai, on average, has access to unbiased estimates of the loss differences within its neighborhood 
set. 

Recall that algorithm Ai is only aware of its neighborhood, and therefore we peg coordinates of q\ to 
zero outside of A^. However, for convenience, our notation below still employs full iV-dimensional vectors, 
and we keep in mind that only coordinates indexed by Ni are considered and modified by Ai. 

When invoked for the s-th time (that is, t — Tj(s)), Ai constructs linear functions (cost estimates) 
SI e R w defined by 



ft 



3&[N] 



where 



n(s) 

E 

r=Ti(s-l) + l 



We now show that /? • has the same conditional expectation as the actual loss of the meta algorithm 
Neighborhood Watch at time t = T{(s). That is, by bounding expected regret of the black-box algorithm 
operating on {/*}, we bound the actual regret suffered by the meta algorithm on the rounds when Ai was 
invoked. 

Lemma 4.1. Consider algorithm Ai- It holds that 



E- 



{(qf s+1) - e u YLe Ui(s+l) | F tM ) = e{// +1 • (gf (s+1) - e u ) \ F Ti{s) } 



Sor any u € iVj . 



G 



Proof. Throughout the proof, we drop the subscript i on 73 to ease the notation. Note that q^ 8+1 ^ = <?j 
since the distribution is not updated when algorithm Aj is not invoked. Hence, conditioned on J-' T f a ), the 

(3) 



variable (<7^ s — e u ) can be taken out of the expectation. We therefore need to show that 



(g[ (s+1) -e u ) .E{Le ir(3+1) |JV (s) } = (g[ (s+1) - e u ) ■ E {// +1 |-F t(s) } 



First, we can write 



t(s+1) 

E { h ttj) I ^( S )} = E <| E 6 (* 

t = T(s) + l 



J) 



TV 



E S E ^)I{*<t( S + 1)} 

[t=r(*)+l 

oo 

£ E{E[b\ id) I{t<T(s + l)} 

t=r(*)+l 
oo 

£ E{l{i<r(, S + l)} 



■E 



t = T(«) + l 



•?>(*)} 
•7>0)} 



The last step follows because the event {t < r(s + 1)} is J^-i-measurable (that is, variables k\, . . . , kt-x 
determine the value of the indicator). By Eq. we conclude 



E { h <£) | • F tw}= E KilitKris + lftpfej-eiYLej, | JV (s) } . 

t=r(s) + l 

Since I {< = r(s + 1)} = I {fc 4 = i} I {f < r(s + 1)}, we have 

E{I{t = T(s + l)}e jt | Jv W }=E{E{I{fc t = i}I{t<r(*+l)}e it | F t -i} | F r (s)} 

= E{l{t<r(s + l)} ej - t E{I{fc t =<} | | ^r(s)} 

= E{l{t< T(s + l)}¥{k t = i | Ft-i)e jt | JV W } 
= E{l{f<r(* + l)}p|e it | T t{s} } . 

Combining with Eq. Q, 

oo 

E { h (ti) I jr Tw}= E M 1 ^ <^( s + 1 )}Pi( e J -- e i) T ^ I Jvw) 

t=r(s)+l 
oo 

= £ E{l{i = T( S + l)}( ej - ei ) T i e , t | -F r(s )} 
t=r(s)+l 

Observe that coordinates of ?T , and e u are zero outside of iVj. We then have that 



(4) 



E 



I{ieJV(} E E{fe~eO T ie 3t I{^r( S + l)} | .F r(>) } 

t=T(s)+l 



I{je JVi} J2 E{ e ,Le Jf I{< = r( S + l)} | Jy W } 

t=r(«)+l 



- C • ljv. 
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where 

oo 

c= 51 E{ ei L ej - t I{t = T(s + l)} | 

t=r(s)+l 

is a scalar. When multiplying the above expression by q^ s+1 — e u , the term cTjv ; vanishes. Thus, minimizing 
regret with relative costs (with respect to the zth action) is the same as minimizing regret with the absolute 
costs. We conclude that 



(g[ (s+1) -e u )E{/;+ 1 | T T{s) } = ( q : (s+1 ^e u ) 



J2 E{e 3 Le n I{t = T(s + l)} \ F r{a) } 



t = T( S ) + l 
OO 



jeNi 



= ( q r s+1} - e„) • K{Le Jt I{t = r( S + l)} \ F r(s) } 



t=T(«) + l 



■(s+1) 



e u ) -E{Le jT(s+1) | J>( s )} 



□ 



4.3 Regret Analysis 

For each algorithm Ai, the estimates /? are passed to a full-information black box algorithm which works 
only on the coordinates A/j. From the point of view of the full- information black box, the game has length 
Tj = max{s : Tj(s) < T}, the (random) number of times action i has been played within T rounds. 

We proceed similarly to pQ: we use a full- information online convex optimization procedure with an 
entropy regularizer (also known as the Exponential Weights Algorithm) which receives the vector f* and 
returns the next mixed strategy x s+1 g Ajv (in fact, effectively in A^i). We then define 

gJ +1 = (l-7)^ +1 + (7/M)l*r< 

where 7 is to be specified later. Since Ai is run at time t, we have Tj(s) — thy definition. The next time Ai 
is active (that is, at time T t (s + 1)), the action I Ti ( s+ i) will be played as a random draw from q\ +1 — gj"^ s+1 ^; 
that is, the distribution is not modified on the interval {r.i(s) + 1, . . . , Ti(s + 1)}. 

We prove Theorem |4.1| by a series of lemmas. The first one is a direct consequence of an external 
regret bound for a Follow the Regularized Leader (FTRL) algorithm in terms of local norms pQ. For a 
strictly convex "regularizer" F, the local norm || ■ \\ x is defined by \\z\\ x — *J z T \I 2 F(x)z and its dual is 
\\z\\l - y/ztV*F(x)-iz. 

Lemma 4.2. The full-information algorithm utilized by Ai has an upper bound 

E jfj/Mgr (s) -e #l) )j < ?? E|fj(||^||^) 2 |+^ 1 log^ + T 7 Z 

on its external regret, where 4>(i) € Ni is any neighbor of i, I = max^.j Li j, and rj is a learning rate parameter 
to be tuned later. 

Proof. Since our decision space is a simplex, it is natural to use the (negative) entropy regularizer, in which 
case FTRL is the same as the Exponential Weights Algorithm. From [TJ Thm 2.1], for any comparator u 
with zero support outside |A^|, the following regret guarantee holds: 

Ti Ti 
s=l s=l 
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An easy calculation shows that in the case of entropy regularizer F, the Hessian V 2 F(x) — diag(a:^~ , x^ 1 , ■ ■ ■ , x^ 1 ) 
and V 2 F(x)~ 1 = diag(xi, X2, ■ ■ ■ ,xn). We refer to pQ for more details. 

Let (j> : {1, . . . , N} H> {1, . . . , N} be a local departure function (sec Definition 3.1 ). We can then write a 
regret guarantee 

Ti Ti 

J2 ft ■ - e 0(i) ) < vY,(\\fl\\l») 2 + V- 1 logdJVil) . 

s=l s=l 

Since, in fact, we play according to a slightly modified version q t of a; 5 , it holds that 

Ti Ti Ti 

£tf • (9? (S) - e* W ) < »?X;(||^||*.) 3 +»7- 1 log(|JV i |) + ■ (<k' is) -x s ) ■ 

s — 1 s — 1 s — 1 

Taking expectations of both sides and upper bounding \Ni\ by AT, 



A proof identical to that of Lemma |4.1| gives 

E {f? ■ (ql' (s) ~ x s ) J TV^-x)} = E {(^ (s) - x°YLe jTiis) \F n (s-i)} 

°3Ti(s) l l OO 



■^Ti(s-l) 



for the last term, where I is the upper bound on the magnitude of entries of L. Putting everything together, 

E {£# • (^ (S) - < »?e|E(IIA ? II:.) 2 | +rr 1 log^V + T 7 Z 

where we have upper bounded 7$ by T. □ 

As with many bandit-type problems, effort is required to show that the variance term is controlled. This 
is the subject of the next lemma. 

Lemma 4.3. The variance term in the bound of Lemma \4-S\ is upper bounded as 

N f Ti ~j 

E E \£(\\ft\\:»?\<uv 2 NT 

Proof. First, fix an i G [N] and consider the term E •|53^. 1 (||/?||*«) 2 j. Until the last step of the proof, we 
will sometimes omit i from the notation. 

We start by observing that /? is a sum of r(s) — r(s — 1) — 1 terms of the type vJjSiej r (that is, of 

constant magnitude) and one term of the type vj jSjej r /q\{j). In controlling ||//||J. S , we therefore have two 
difficulties: controlling the number of constant-size terms and making sure the last term does not explode 
due to division by a small probability q\(j)- The former is solved below by a careful argument below, while 
the latter problem is solved according to usual bandit-style arguments. 

More precisely, we can write f* — 9^1(1)^ + h Ti ^ where the vectors 9^^\ X \ h Ti ^ G R N are defined as 

n(s)-i 

r—Ti (s — 1) 
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and 

^W(i) =I{J T<W =i}<; rjW \ w % w /9r W ( J nW) • 

Then 

(\\m\Z*) 2 = (\\9 Tds - 1} + ft Ti(a) li;.) 2 < 2(H^<— ^N*.) 2 + 2(||^W||;.) 2 

We will bound each of the two terms separately, in expectation. For the second term, 
(||^)||; s ) 2 = x s (I T )(vl Ir S lT e jT /qJ(I T )) 2 < x s {I T ){v/ql{I T ) f 

where r = Ti(s). Since ql'^ — (1 — ^)x s + ('y/|JVi|)ljv i3 it is easy to verify that x s (I T )/ql(I T ) < 2 (whenever 
7 < 1/2) and thus 

(\\h T ^\\:,f<2v 2 / q j(i T ) . 

The remaining division by the probability disappears under the expectation: 

N 

E {(||^ (s) ||:») 2 | . • . , kr t(a) )} < 2V 2 J21? {S) U)/<1? (S) U) = 2 ^ 2 ■ (5) 

i=i 

Consider now the second term. As discussed in the proof of Lemma |4~2"1 the inverse Hessian of the entropy 
function shrinks each coordinate i precisely by X s (i) < 1, implying that the local norm is dominated by the 
Euclidean norm : 

ii^'-^ii:. < llff^-^lla. 

It is therefore enough to upper bound E j^^li lls 7 '^''!! 2 :}- The idea of the proof is the following. Observe 
that P(fc( = i|J-f_i) = P(/t = i|J-f_i). Conditioned on the event that either k t = i or I t = i, each of the two 
possibilities has probability 1/2 of occurring. Note that g Ti ( s_1 ) inflates every time kt ^ i, yet It = i occurs. 
It is then easy to see that magnitude of unlikely to get large before algorithm Ai is run again. We 

now make this intuition precise. 

The function g l is presently defined only for those time steps when t = r^(s) for some s (that is, when 
the algorithm Ai is invoked). We extend this definition as follows. Let the jth coordinate of g l be defined as 

7r(i+l)-l 

4(t+i)0) -5*0') - i{Ir = i}v ihj) S l e jr 

r=t 

for j G Ni and otherwise. The function g* can be thought of as accumulating partial pieces on rounds 
when It — i until k t — i occurs. Let us now define an analogue of r and n for the event that either I t = i or 
h = i: 

7i(s) = min |< : s = y^I{fct = i or I t = i}j 

Further, for any t, let 

Vi(t) = min{t' > t : k t = i or I t = i}, 
the next time occurrence of the event {k T — i or I T — i} on or after t. Let 

I=I{l/<(t)^7T < (t)} 

be the indicator of the event that the first time after t that {k T — i or I T = i} occurred it was also the case 
that the algorithm was not run (i.e. k T ^ i). Note that g (j) can now be written recursively as 

g t (j)=l- v (id) Sie Mt) +<S' ) ( t)+i ) 0') 
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As argued before, P(I = l\F t -i) = 1/2. We will now show that E{#*(j) | T t -i} < 2v by the following 
inductive argument, whose base case trivially holds for t = T: 

E{ff*(i) I Ft-i} =e{e{i- [v (iJ) Sie Mt) + g^ t)+1 (j)} | F»(t)} | ^-i} 
<v+E{lg»W+\j) | J- f _x} 



= v +E IE 



^(t)+l 



(?) 



J", 



•Ft-i} 



< 2w by induction 

< S + E{I | J r 4 _i}2w 

< 5 + (1/2)2?; = 25 

The expected value of (g'C?)) 2 can be controlled in a similar manner. To ease the notation, let z 
v (i,j)Si e jvt t )- Using the upper bound for the conditional expectation of calculated above, 

E{(ffW | Ft-i} =e{i- (z 2 + (. 9 ^+ 1 ( J )) 2 + 2^ (t)+1 0-)) | Ft-x) 

= E{lz 2 +m{(g^ +1 (j)) 2 | J- I , (t) }+2IzE{^ t )+ 1 (j) | F u{t) ) | T^r} 

<5« 2 +e{ie{(. 9 ^) +1 0-)) 2 I Jv (t) } | J-t-i} 
The argument now proceeds with backward induction exactly as above. We conclude that 

E{(gW | <10« 2 

and, hence, 

E{||fl T «( 4, - 1 )|||} < lOiVw 2 

Together with ([5]), we conclude that 

e{(||//||*«) 2 } <2{2Nv 2 + lQNv 



2 ^ = 2Av 2 N. 



Summing over t = 1, . . . , T and observing that only one algorithm is run at any time t proves the statement. 

□ 

Proof of Theorem \4-l\ The flow condition p l = Q t p t comes in crucially in several places throughout the 
proofs, and the next argument is one of them. Observe that 



N N 



N 



N 



N 



E { e 4>(h) \Ft-l} =^2J2Pk1k(i) e 4 

and thus 



k=l t=l 



fc=l 



i=l 



= E<j^E{e 0(fct) |JS-x} T ie^| 

= E {E e ; (fet ) L ^} 
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It is because of this equality that external regret with respect to the local neighborhood can be turned into 
local internal regret. We have that 

E |E( e/ * - e <p(it)) T Le n^ = E |E( e/ * ~ e 0(fc t )) T - Le it I 

= E jj^fafc, -e 0(fct) ) T Le Jt | 

N ( T \ 

= E E E 1 ^' = l > («i " e Hi)) TLe h 
1=1 lt=l J 



By Lemma |4~T 



E 



and so by Lemma 4.2 



N 



E E^ " HiU)YLe Jt =E E lE^'fe 



e 0(»)) 



1=1 l.s = l 



^E E £(ii/'ii£O a + ^(t 1 lo s^ + 



1=1 Ls=l 



With the help of Lemma 4.3 



E jE( e / t - e0(/ t )) T ie Jt | < rj24v 2 NT + N(r)^ 1 log N + Tj£) = 4Nv^6{\og N)T + TN"/£ 



for the setting of r\ = y ■ 

We remark that for the purposes of "in expectation" bounds, we can simply set 7 = and still get O(VT) 
guarantees (see PQ). This point is obscured by the fact that the original algorithm of Auer et al [2] uses 
the same parameter for the learning rate 77 and exploration 7. If these are separated, the "in expectation" 
analysis of [5] can be also done with 7 = 0. However, to prove high probability bounds on regret, a setting of 
7 oc T -1 / 2 is required. Using the techniques in [T], the h igh-p robability extension of results in this paper is 
straightforward (tails for the terms ||g Ti( ' s_1 ^ ||| in Lemma 4.3 can be controlled without much difficulty). □ 



5 Random Signals 

We now briefly consider the setting of partial monitoring with random signals, studied by Rustichini [llj . 
Lugosi, Mannor, and Stoltz [5], and Perchet [5]. Without much modification of the above arguments, the 
local observability condition yet again yields O(vT) internal regret. 

Suppose that instead of receiving deterministic feedback -Hij, the decision maker now receives a random 
signal di : j drawn according to the distribution H^j € A(£) over the signals. In the problem of deterministic 
feedback studied in the paper so far, the signal Hij = a was identified with the Dirac distribution 8 a . 



Given the matrix H of distributions on E, we can construct, for each row i, a matrix Hj € 

Si(fc, 3) = H ij{a k ) 



as 



where the set <j\, . . . , cr s . is the union of supports of H% \ % ■ ■ ■ , Hi.M- Columns of are now distributions over 
signals. Given the actions It and jt of the player and the opponent, the feedback provided to the player can 
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be equivalently written as Sj e Jt where each column r of the random matrix Sj G R S;XM is a standard unit 
vector drawn independently according to the distribution given by the column r of Ej. Hence, ES* = E$. 

As before, the matrix S^.,-) is constructed by stacking Ej on top of Ej. The local observability condition, 
adapted to the case of random signals, can now be stated as: 

k - *i e im % tj) 

for all neighboring actions 

Let us specify the few places where the analysis slightly differs from the arguments of the paper. Since 
we now have an extra (independent) source of randomness, we define Tt to be the er-algebra generated by 
the random variables {ki,Ii,S . . . 1 k t ,I t ,S t } where 5 1 ' is the random matrix obtained by stacking all Sj. 
We now define the estimates 



t r A t 



I{Ir = i}Sj 
I{k r =i}I{I r =j}S t j /qT{j) 



-It 1 



Vr g {n{s- l) + l,...,Ti(s)}, VjG JVj 
with the only modification that Sj and Sj are now random variables. Equation ([2| now reads 



E 



N 

fe=i 

= Pi«ij B (ij)ej t 







= p*( e j - e i) TLe j t ■ 

The rest of the analysis follows as in Section |4.3[ with S in place of S 



(6) 
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